sample question
What Does Your Benchmark Really Measure? A Framework for Robust Inference of AI Capabilities
Evaluations of generative models on benchmark data are now ubiquitous, and their outcomes critically shape public and scientific expectations of AI's capabilities. Yet growing skepticism surrounds their reliability. How can we know that a reported accuracy genuinely reflects a model's true performance? Evaluations are often presented as simple measurements, but in reality they are inferences: to treat benchmark scores as evidence of capability is already to assume a theory of what capability is and how it manifests in a test. We make this step explicit by proposing a principled framework for evaluation as inference: begin from a theory of capability, and then derive methods for estimating it. This perspective, familiar in fields such as psychometrics, has not yet become commonplace in AI evaluation. As a proof of concept, we address a central challenge that undermines reliability: sensitivity to perturbations. After formulating a model of ability, we introduce methods that infer ability while accounting for uncertainty from sensitivity and finite samples, including an adaptive algorithm that significantly reduces sample complexity. Together, these contributions lay the groundwork for more reliable and trustworthy estimates of AI capabilities as measured through benchmarks.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Middle East > Republic of Türkiye > Batman Province > Batman (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)
A novel interface for adversarial trivia question-writing
A critical component when developing question-answering AIs is an adversarial dataset that challenges models to adapt to the complex syntax and reasoning underlying our natural language. Present techniques for procedurally generating adversarial texts are not robust enough for training on complex tasks such as answering multi-sentence trivia questions. We instead turn to human-generated data by introducing an interface for collecting adversarial human-written trivia questions. Our interface is aimed towards question writers and players of Quiz Bowl, a buzzer-based trivia competition where paragraph-long questions consist of a sequence of clues of decreasing difficulty. To incentivize usage, a suite of machine learning-based tools in our interface assist humans in writing questions that are more challenging to answer for Quiz Bowl players and computers alike. Not only does our interface gather training data for the groundbreaking Quiz Bowl AI project QANTA, but it is also a proof-of-concept of future adversarial data collection for question-answering systems. The results of performance-testing our interface with ten originally-composed questions indicate that, despite some flaws, our interface's novel question-writing features as well as its real-time exposure of useful responses from our machine models could facilitate and enhance the collection of adversarial questions. The code for our interface is available at: https://github.com/Zefan-Cai/QAML
- South America > Paraguay (0.04)
- North America > United States > Maryland (0.04)
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- (11 more...)
- Education (0.48)
- Leisure & Entertainment (0.47)
The Complete Collection of Data Science Interviews – Part 1 - KDnuggets
Were you in the situation when the interviewer asked you a situational or technical question, and you froze up? Just because you were not prepared for it. It happens to many, including me. I have tendencies to freeze during technical interviews, and the hiring manager will take it as my weakness to reject me at the initial stage of the recruitment process. To overcome this problem, I started to look at sample interview questions.
British doctors go on the defensive due to 'high-performing' 'GP at Hand' app
LONDON – A medical chatbot said to perform as well as or even better than human doctors has sparked a war of words in Britain, in a clash over how much the cash-strapped public health service should rely on artificial intelligence. AI company Babylon, which is already working with the National Health Service, claimed its chatbot scored higher marks than real live doctors in "robust tests." The British firm said it quizzed the AI using sample questions for trainee exams set by Britain's Royal College of General Practitioners (RCGP), the professional body for family doctors. The programmed chatbot, a key feature of Babylon's "GP at Hand" app, scored 81 percent when sitting the test for the first time, while the average pass mark over the past five years for doctors was 72 percent, according to the company. Ali Parsa, its founder who presented the findings in London earlier this week, hailed the results as "a landmark." "(They) take humanity a significant step closer to achieving a world where no one is denied safe and accurate health advice," he said in a statement.
- Europe > United Kingdom (0.64)
- Africa > Rwanda (0.06)
- Europe > Italy (0.05)
- North America > United States > California > San Diego County > San Diego (0.05)
- Africa (0.05)
- Europe > Greece (0.04)
Could YOU pass the secretive Oxford entrance exam? University reveals some of its most common questions - and how to answer them
It's a question you might never have considered before – why do older siblings do better on IQ tests than their younger counterparts? But if you want to get into Oxford's experimental psychology program, you'd better be prepared to answer. The university has released a series of questions from tutors who conduct the infamous interviews, revealing the complex problems in everything from mathematics to medicine used to spot the sharpest candidates. Oxford has released a series of questions from tutors who conduct the infamous interviews, revealing the complex problems in everything from mathematics to medicine used to spot the sharpest candidates. Oxford has revealed five interview questions spanning Modern Languages, Medicine, Philosophy, Maths, and Experimental Psychology.
- Europe > United Kingdom (0.17)
- Asia > Philippines (0.05)